Modelling the lymphatic metastatic progression pathways of OPSCC from multi-institutional datasets

The elective clinical target volume (CTV-N) in oropharyngeal squamous cell carcinoma (OPSCC) is currently based mostly on the prevalence of lymph node metastases in different lymph node levels (LNLs) for a given primary tumor location. We present a probabilistic model for ipsilateral lymphatic spread that can quantify the microscopic nodal involvement risk based on an individual patient’s T-category and clinical involvement of LNLs at diagnosis. We extend a previously published hidden Markov model (HMM), which models the LNLs (I, II, III, IV, V, and VII) as hidden binary random variables (RVs). Each represents a patient’s true state of lymphatic involvement. Clinical involvement at diagnosis represents the observed binary RVs linked to the true state via sensitivity and specificity. The primary tumor and the hidden RVs are connected in a graph. Each edge represents the conditional probability of metastatic spread per abstract time-step, given disease at the edge’s starting node. To learn these probabilities, we draw Markov chain Monte Carlo samples from the likelihood of a dataset (686 OPSCC patients) from three institutions. We compute the model evidence using thermodynamic integration for different graphs to determine which describes the data best.The graph maximizing the model evidence connects the tumor to each LNL and the LNLs I through V in order. It predicts the risk of occult disease in level IV is below 5% if level III is clinically negative, and that the risk of occult disease in level V is below 5% except for advanced T-category (T3 and T4) patients with clinical involvement of levels II, III, and IV. The provided statistical model of nodal involvement in OPSCC patients trained on multi-institutional data may guide the design of clinical trials on volume-deescalated treatment of OPSCC and contribute to more personal guidelines on elective nodal treatment.


INTRODUCTION
When treating head and neck squamous cell carcinoma (HNSCC) with radiotherapy or surgery, the aim is to irradiate or resect as much of the malignant tissue Corresponding author: Roman Ludwig roman.ludwig@usz.chas possible.This includes the primary tumor mass and clinically detected lymph node metastases.However, to reduce the risk of locoregional failure, treatment also includes regions of the lymph drainage system of the neck with possible microscopic tumor spread, which in-vivo imaging modalities such as computed tomography (CT), magnetic resonance imaging (MRI), or positron emission tomography (PET) cannot detect.This is referred to as elective nodal irradiation or prophylactic neck dissection.Treatment decisions regarding the CTV-N or the extent of neck dissection must balance the conflicting goals of treating regions at risk of occult lymph node metastases to avoid recurrences while avoiding toxicity related to unnecessary treatment of healthy tissues.
This work concerns itself with OPSCC, where approximately 70-80% of patients present with lymph node metastases at the time of diagnosis.In clinical practice, CTV-N definition in radiotherapy is based on guidelines (Grégoire et al. 2003(Grégoire et al. , 2014;;Grégoire & Others 2018;Eisbruch et al. 2002;Biau et al. 2019;Chao et al. 2002;Vorwerk & Hess 2011;Ferlito et al. 2009) that are mostly derived from the observed prevalence of involvement in an LNL for a given tumor location.These guidelines currently suggest extensive irradiation of both sides of the neck for most patients.In the ipsilateral neck, the CTV-N includes LNLs II, III and IV for all patients, and levels I and V for the majority of patients.These guidelines, however, do not account for the personal risk of the patients that may depend greatly on their state of tumor progression at diagnosis.E.g., a patient with macroscopic metastases detected via PET in both LNLs II and III may have a substantial risk for occult disease in LNL IV.Instead, patients who present with a clinically N0 neck or a single metastasis in LNL II may have a much smaller risk for occult disease in LNL IV.
We previously developed a model of lymphatic metastatic progression for estimating the risk of microscopic disease, given a patient's personal diagnosis.The initial model was based on the methodology of Bayesian networks (BNs) (Pouymayou et al. 2019).It was subsequently extended and formulated as an HMM to include T-category in an intuitive manner (Ludwig et al. 2021).However, these models were introduced based on only a small dataset of approximately 100 early T-category patients available at that time (Sanguineti & Others 2009).The limited data did not allow us to quantify the probability of metastases in the rarely involved LNLs I, V, and VII, nor did the data allow us to verify that the HMM is adequate to describe the dependence on lymph node involvement on T-category.In this paper, we extend the previous work (Ludwig et al. 2021) by making the following contributions: 1. We provide a HMM of ipsilateral lymph node involvement including all relevant LNLs, namely the levels I, II, III, IV, V, and VII.To determine the optimal underlying directed acyclic graph (DAG) we compare different graphs by calculating the model evidence through thermodynamic integration (TI).2. We collect a multi-centric dataset consisting of 686 patients from three institutions, allowing us to train the model based on a sizable dataset (Ludwig et al. 2022a(Ludwig et al. , 2023c)).
3. We use the trained model to provide personalized risk estimations for occult metastases for typical clinical states of tumor progression at diagnosis, illustrating its potential for guiding volumedeescalated treatment strategies in the future.

State of the Hidden Markov Model
We have introduced a probabilistic model for lymph node involvement based on Bayesian networks (BNs) in (Pouymayou et al. 2019).The model was extended using HMMs in (Ludwig et al. 2021).We will briefly recap the hidden Markov model to introduce the notation used throughout the work.
A patient's state of (hidden) lymphatic involvement at time t is described as a collection of binary RVs, one for each of the V LNLs: Each of the LNLs can be in the state X v = 0 (FALSE), meaning LNL v is healthy, or in the state X v = 1 (TRUE), indicating the LNL harbors metastases.The involved state includes occult disease.
The transition from one time-step to another is governed by the transition probability be collected into a transition matrix when we enumerate all 2 V distinct possible states ξ i with i ∈ 1, 2, . . ., 2 V of lymphatic involvement: The term P ξ i | ξ j describes the probability to transition from the hidden state of lymphatic involvement ξ j to the state ξ i between the time t and t + 1.Using a DAG as depicted in Figure 1, we can formulate this transition probability in the following way: (3) In this equation, we have denoted LNLs that are parents of LNL v with the symbol r ∈ pa(v).Also, ξ iv denotes the value that LNL v takes on when the patient is in state ξ i .The term Q(a; b) ∈ {0, 1} is there to prohibit self-healing.It is always one, except if LNL v is healthy in state ξ i , but was metastatic in the previous state ξ j .In that case the function becomes Q(0; 1) = 0, making the transition back to healthier states impossible.
The terms of the form P ξ iv | {ξ jr } r∈pa(v) implicitly depend on how we parameterize the arcs of Figure 1.For example, if we look at the probability of spread to LNL III (X 3 ) depending on the state of that level's parentwhich, in this case, is pa(3) = 2 -we can write the different combinations into a conditional probability table as below.
The variable b 3 denotes the probability of lymphatic spread from the tumor to LNL III during one time-step, and t 23 is the probability of spread from an involved level II further down the lymphatic chain into LNL III.

Diagnostic Observation
We also need to introduce a separate collection of RVs that describe the diagnostic observation of a patient's involvement.In analogy to the hidden true state X[t] at time t, we write this diagnosis as We do not need to differentiate between different times t here, since a patient is ever only diagnosed once, after which treatment usually starts timely.Consequently, the conditional probability to observe a diagnosis Z = ζ ℓ , given a hidden involvement state X = ξ k is a matrix B made up of products of terms from the table above: We define the time t = 0 to be the moment just before a patient's tumor formed, and hence X v [t = 0] = 0 ∀v.However, using this definition, we cannot know how many time-steps have passed until t D , when the patient was diagnosed with cancer.We can only make the assumption that a patients with an earlier T-category tumor was probably diagnosed after fewer time-steps than a patient with a advanced T-category tumor.
We can use this assumption by marginalizing over the diagnose times t D of patients in different T-categories using different prior distributions over the diagnose time.E.g., P (t = t D | early) for early T-category patients (T1 & T2) and P (t = t D | late) for advanced T-category patients (T3 & T4).Throughout this work we will use binomial distributions for these probability mass functions, mainly because they have a plausible shape for this purpose and only a single parameter.
Here, the parameter p Tx can be interpreted as the probability that the patient with a tumor of T-category x will be diagnosed at time-step t + 1 given they are in timestep t.We will use as the latest time-step t max = 10, which will therefore give us a distribution over the diagnosis time that has its mean at E[t D ] = 10 • p Tx .

The Likelihood Function
Using the definitions up to this point, we can compute a vector of likelihoods for every possible diagnosis: This likelihood implicitly depends on how we parameterize the arcs of the DAG underlying the model -see Equation 3-and the parameterization of the distribution over diagnosis times -e.g., as in Equation 6.
Together with the parametrizations of the distributions over the diagnosis time, the parameters b v and t vr that make up the transition matrix A comprises the set of model parameters: To infer these parameters from a dataset of N OPSCC patients D = (d 1 , d 2 , . . ., d N ), we compute the data loglikelihood: Which effectively amounts to computing the elementwise logarithm of the likelihood vector ℓ from Equation 7and summing up the entries that correspond to each of the patients d i for i ≤ N .Note that it is also possible to account for incomplete diagnoses, i.e., a diagnosis where the involvement information for one or more LNLs is missing.In that case, we can sum over those elements of ℓ that correspond to complete diagnoses which match the provided incomplete one.In this paper, for some patients involvement information of level VII was missing and hence marginalized over.A detailed explanation of this formalism can be found in (Ludwig 2023), section 6.2.7.
Using this log-likelihood function one may now employ a variety of inference methods to learn the parameters of the model that best describe the observed data.

Parameter Inference
We use Markov chain Monte Carlo (MCMC) sampling to draw parameter samples θi for i ≤ S from the likelihood described in Equation 9 (i.e. the unnormalized posterior distribution over the parameters θ, since we used a uniform prior in this work).1, red arcs are parametrized with probabilities of spread from the tumor (red circle) to the LNLs (blue circles).These red arcs, together with the blue arcs fron LNL to LNL, make up the base graph.One after the other, each of the green arcs was added to the base graph.Subsequently, the performance of the resulting models in terms of its BIC was compared to the base graph to assess whether the additional edge should be kept in the winning graph or not.
More specifically, we use the Python implementation emcee (Foreman-Mackey et al. 2013) and two sample proposal mechanisms based on differential evolution moves (ter Braak & Vrugt 2008;Nelson et al. 2013) for sampling.Instead of proposing and then accepting or rejecting individual parameter samples one after the other (as in the classical Metropolis-Hastings algorithm), the emcee implementation makes use of an ensemble of W so-called "walkers".This gives rise to W parallel chains of samples that mutually influence each others proposal such that the sampling proceedure overall is affine invariant.This means that scaling the parameter space along any dimension has no effect on the performance of the MCMC sampling algorithm.
For the experiments in this work, we used W = 20 • k walkers, where k is the dimensionality of the parameter space Θ.After an initial "burn-in" phase, during which all drawn samples are discarded because they are not yet independent of the initial state, we continued sampling for another 200 steps of which we discarded every 10th to be left with S = 20 • W samples.
These S parameter estimates are then used to compute expectation values of estimates that depend on the parameters θ through an integral over the parameter space Θ: These values were computed during a TI run for the loosing graph model (red line), as well as the winning graph model (green line).Note that the scaling of the x-axis is chosen such that the 64 points on the temperature ladder appear evenly spaced.This is to stress how the log-accuracy develops in the range from β = 0 to around β = 0.1.The area under these two curves yields an approximation of the log-evidence of the respective model.It is barely visible that due to the simplicity of the loosing graph model its accuracy increases earlier and faster that the winning graph model's.However, already before β = 0.031 the model based on the winning graph outperforms the loosing graph model.Ultimately, the winning graph's log-evidence is better than the loosing graph model's by a value of around 28.46 (see Table 6).
Alternatively, the individual fi = f θi can be used to plot histograms over the distribution of f .We will do so in section 5 to show distributions over prevalence predictions and risk computations.
Another relevant model parameter that needs to be set for the inference process, is the maximum number of time-steps we used for the evolution of the system.We set this value to t max = 10, such that t ∈ {0, 1, 2, . . ., 10}.The binomial "success probability" used to fix the shape of the early T-category's time-prior was set to p early = 0.3.

Risk estimation
The main task for personalizing the CTV-N definition is to predict the probability of the hidden possible states ξ k given the diagnosis d ⋆ = ζ ℓ of a new patient.Using Bayes' theorem, we get The described model along with the inferred parameters θ will yield an estimate (or multiple estimates) for the "prior" in the above equation P ξ k | θ .
From this probability for any possible hidden state, we can also compute the probability of, for example, involvement in LNL IV.To that end, we marginalize over all states ξ k where ξ k4 = 1, meaning those states in which LNL IV harbors metastases.Formally, we can define a marginalization vector m that is one for every hidden state we want to include in the marginalization and zero elsewhere.In the example of the marginalized probability for LNL IV involvement, the components would look like this: Subsequently, we can compute the marginalization as a dot product:

Investigating Spread Graphs
The DAG shown in Figure 1 includes the LNLs II, III, and IV, which represent the most relevant lymph node levels for OPSCC.It includes the arcs from II to III and from III to IV, representing the main direction of lymphatic drainage, which is well motivated anatomically and by the data on lymph node involvement.Previous publications (Pouymayou et al. 2019;Ludwig et al. 2021) focused on these levels because they relied on a limited reconstructed dataset of OPSCC patients (Sanguineti & Others 2009).Now, with the datasets available for this work, we can extend the graph to include random variables for all LNLs that are relevant for OP-SCC: I, II, III, IV, V, and VII.The main question to answer is: which arcs between LNLs are needed to accurately model the data on lymph node involvement without increasing the model's complexity unnecessarily.
First, we notice that the direct arcs from tumor to each of the LNLs must be present, since every LNL appears metastatic in isolation at least once in the dataset.For example, some patients presented with metastases in LNL I, while the other levels appeared healthy.If no spread was allowed from the tumor to X 1 (i.e., b 1 = 0), the likelihood of observing this patient would be zero.
We have more freedom in choosing how to connect the LNLs to each other.To investigate which connections to add we start by establishing a baseline from a model using a minimal base graph.It contains only the connections from LNL II to III and an arc from LNL III to IV, as motivated by the main lymphatic pathway (Lengelé et al. 2007).The base graph is illustrated in Figure 2 via the red and blue arcs.The lymphatic drainage to or from levels I, V, and VII is not as clearly defined.Therefore, we define a set of candidate arcs (green arcs in Figure 2) and use the model comparison methodology described in subsection 3.2 to determine which graph is most supported our data.
To connect level I, we investigate two candidate arcs: from I to II and from II to I.An arc from I to II was used in (Pouymayou et al. 2019;Ludwig et al. 2021) and is anatomically motivated.However, since LNL I is rarely involved compared to level II, the associated parameter t 12 is mostly undetermined.Therefore, we also consider the flipped arc from LNL II to I and investigate if it helps to describe the correlations between the involvement of levels I and II.
Anatomically, the posterior accessory pathway that drains LNL II through LNL V motivates investigating and arc from X 2 into X 5 (Lengelé et al. 2007).And, although no lymphatic pathway is described that directly drains LNL III or IV into level V, due to their proximity to each other, we will also investigate additional edges from the levels III and IV into LNL V. Finally, we look at adding an arc from LNL II to LNL VII also due to their anatomical proximity.
To determine the optimal graph, we first consider six models, each with one of the six green candidate arcs of Figure 2 added to the base graph.Every one of these models was evaluated by computing an approximation to its evidence via thermodynamic integration as described below in subsection 3.2.Subsequently, graphs combining multiple arcs that individually improve the model evidence are considered.Thereby, the "winning graph" is determined, which yields the highest (i.e., the least negative) value of the logarithm of the model evidence.

Model Comparison
The aim of this work is to refine the graph structure underlying our risk model introduced in the previous section.This DAG determines the number of parameters of the model as well as how exactly the transition matrix A is parameterized.To compare different models that are based on different DAGs, e.g.models M 1 and M 2 , in a Bayesian setting, we need to compute the Table 3.
Interpretation of Bayes factors and their natural logarithms in terms of their support for or against one of the two compared models as introduced by Jeffreys (1998).

K1v2
ln K1v2 support for M1 < 10 0 < 0 negative (supports M2) 10 0 to 10 probabilities of these models, given the data D: If we assume all models M i for i ∈ {1, 2} to have the same a priori probability -meaning in this case P (M 1 ) = P (M 2 ) -then we can compute the so-called Bayes factor of the two models as the ratio of their likelihoods.The interpretation of the values for different Bayes factors is given in Table 3.It is defined as follows: These likelihoods are commonly called the model evidence or marginal likelihood.The latter because computing it involves marginalizing the data likelihood over all model parameters: (16) However, this quantity is often very hard to compute or even intractable, due to the high dimensionality of the parameter space Θ.In our case, the number of dimensions ranges from k = 9 for the base graph to k = 11 for the winning graph.A brute-force integration over a unit cube with this many dimensions is inefficient and errorprone, which is why we resorted to TI for computing the (log-)evidence.
Below, we will briefly outline the main concept behind this algorithm.An intuitive and extensive derivation of TI is given by (Aponte et al. 2022).
We start by taking the logarithm of the model evidence E and subtract a zero from it in the form of the term 0 = ln p(θ | M)dθ.Further, we can multiply the distribution over the parameters θ inside this integral by 1 = P (D | θ, M) β=0 .Subsequently, we can write the logarithm of the evidence as an integral over a derivative: The derivatives of the log-evidences ln E β are essentially expectation values of the data log-likelihood under the power posteriors of the corresponding value for β.They can be computed using MCMC: The integral in Equation 17 can then be computed via a trapezoidal rule using the A MC to yield a numerical approximation of the model evidence: (19) This estimate gets better for more samples S per sampling from the power posterior p β but more importantly it gets better for a tighter spacing of the values for β within the interval [0, 1].The variable β is also often referred to as an inverse temperature, due to its origins Table 5. Prevalence of involvement patterns in the multicentric dataset.An involvement pattern is characterized by the state of the six LNLs: A red dot means the LNL was reported to be metastatic, a green dot means it was determined to be healthy and a question mark means that the prevalence was marginalized over the state of this LNL.
For the TIs that were performed in this work we used such a fifth order power rule with 64 steps, meaning that R = 63.
The process of computing the log-evidence using TI was as follows: We randomly initialized the starting positions of the W samplers in the ensemble within the k dimensional unit cube Θ.Subsequently, for each j ∈ {0, 1, 2, . . ., R = 63} we drew samples from the corresponding power posterior with the value of β j set according to the power rule in Equation 20.This sampling at point j consisted of 1000 burn-in steps, followed by 200 steps, of which only every tenth was kept.The last position of the W chains for the j-th β value in the ladder was used to initialize the subsequent sampling round with β j+1 .Hence, after the computations are finished, we are left with S = 20 • W samples θi,j and respective log-likelihood li,j from each of the 64 power posteriors corresponding to the respective β j .Subsequently, we numerically integrated the following quantity S times: And then computed the mean and standard deviation of all the integrated ln Êi .We then used this for the log-evidence and its error.
Without derivation or insight, we would like to mention that the model evidence naturally balances a model's accuracy against its complexity.The value of ln E will generally be larger (i.e., less negative) if a model fits the data better than another while being similarly complex.On the other hand, if e.g.additional parameters are introduced without sufficiently improving how well the model explains the data, the evidence will penalize the increase in complexity.
An approximation to the evidence that also attempts to balance accuracy and complexity against each other is the heuristic called BIC.The negative one half of the BIC approximates the ln E via Lagrange's method (Bhat & Kumar 2010) and yields an easy to compute estimate that may also be used to compare models, as long as its underlying assumptions are valid: Here, L = max θ (ln P (D | θ)) is the maximum loglikelihood.The approximation is good, when the posterior distribution over the parameters p (θ | D) is singlemodal and falls quickly to zero from the maximum.Also, the number of data points N needs to be much larger than the number of parameters k.We will see that for the models we consider here, the BIC is generally a good approximation and the conclusions drawn from comparing models using this metric can be reproduced reliably using the true model evidence computed with TI.

Reproducibility
The entire methodology used in this work is publicly available in the GitHub repository rmnldwg/lynference.Tagged references to specific versions of an inference pipeline allow reproducing the inferred parameters of the models described here.Every parameter necessary to reproduce such a pipeline is specified there in designated configuration files.It also defines the sequences of computations that constitute the pipeline that are mostly calls to commands of the lyscripts we published.
All figures in this work, like the risk predictions and prevalences we show histograms of, can be reproduced following the instructions of the GitHub repository underlying this publication rmnldwg/graph-extension-paper.It is based off of the showyourwork project that aims to make scientific papers easily and fully reproducible.

MULTICENTRIC DATASET
The dataset D that we used for inference is comprised of the detailed reports on lymph node involvement patterns in OPSCC patients treated at three different institutions in France and Switzerland: The Centre Leon Bérard (CLB) in Lyon (France), the Inselspital Bern (ISB) in Bern (Switzerland), and the University Hospital Zurich (USZ) in Zürich (Switzerland).We have previously published the patterns of nodal involvement for the USZ cohort (287 patients) (Ludwig et al. 2022a(Ludwig et al. , 2023d) ) and described its characteristics in detail (Ludwig et al. 2022b).The first CLB dataset (263 patients) (Ludwig et al. 2022c) underlies a publication on human papilloma virus (HPV) status in OPSCC (Bauwens et al. 2021) and is made available in a separate "Data in Brief" article alongside the second dataset from France (Ludwig et al. 2023a) and the lymphatic progression patterns from the ISB (Ludwig et al. 2023b,c).All datasets may be explored online in our web-based interface LyProX.
In total, the dataset contains 686 patients with newly diagnosed OPSCC.It includes patients treated with definitive (chemo)radiotherapy, adjuvant (chemo)radiotherapy following neck dissection, or neck dissection alone.Pathologically assessed LNL involvement was available for 263 surgically treated patients, while for the remainder the nodal involvement was assessed based on available diagnostic modalities (FDG-PET-CT, CT, MRI, FNA).If multiple modalities were used to diagnose a patient's lymph node involvement, the available modalities were combined into a consensus decision.When different modalities were conflicting, the conflicts were resolved by inferring the most likely state (healthy or metastatic) for each LNL separately.To do so, we used literature values for the sensitivity and speci-ficity of the diagnostic modalities (De Bondt & Others 2007;Kyzas & Others 2008), which we also tabulated in Table 4. Practically, this means that, whenever pathology after neck dissection was available, the pathology result was taken as the consensus, overruling any other clinical diagnostic modality.If, for example, PET-CT and MRI ware available and conflicting, PET-CT was taken as the consensus, overruling MRI.
The dataset containing the consensus decision for the involvement of each level in every patient was then used for model parameter learning.We assumed that it represents an observation of the true hidden state ξ k .The frequencies of some of the most important combinations of involved lymph node levels are listed in Table 5.

Involvement of Levels II, III, and IV
For the base graph, we have plotted the predicted prevalence of involvement patterns in the investigated patient cohort for scenarios involving the most commonly metastatic LNLs II, III and IV in Figure 4.It shows -for each pattern of lymphatic involvementtwo plots overlaid: 1.The colored histograms over the base graph model's prediction for the prevalence of the respective pattern of involvement.These histograms are obtained by computing the same prevalence with different samples from the inference process, thus providing us with a measure of uncertainty for the prediction.
2. Colored lines, depicting the beta posterior over the same involvement pattern's prevalence, given a uniform beta prior and the binomial likelihood of the observed data.The maximum of the beta distributions always coincides with the data prevalence but we additionally gain an intuition into how statistically significant the data is.E.g., Observing 3 out of 10 patients with a particular pattern of nodal metastases is less convincing than 300 out of a cohort of 1000 patients.A beta posterior over these prevalences reflects that in its variance.
Figure 4 shows that this minimal graph is already capable of describing the most important parts of the observed data very well.Notably, the model is not only accurate in its predictions, it also correctly estimates the variance stemming from the limited amount of data.The separation between involvement prevalences of early and advanced T-category tumors is also reproduced well by the model.This is remarkable because the model Table 6.Model comparison results from the base graph and the extended graph we chose as the "winning" model.For both DAGs we show the log-evidence, computed via thermodynamic integration, the negative one half of the BIC, as well as the maximum log-likelihood that was encountered during the final MCMC sampling round.graph log-evidence introduces only a single parameter to describe the differences between early and advanced T-category for all involvement patterns.This shows that expecting later diagnosis times, on average, for patients with advanced T-category tumors can explain more severe lymphatic involvement.
It is interesting to note that not all involvement patterns become more prevalent with advanced Tcategory.For example, a healthy LNL III together with a metastatic level II is observed slightly less often for advanced T-category tumors compared to early T-category (yellow histogram, row 1 versus 2).This is because, for advanced T-category, it is more likely the disease has already spread to LNL III (blue histogram, row 1 versus 2).Our model captures this accurately and precisely.

Comparison of candidates graphs
The model evidences of all candidate graphs are reported in Table 6.A visual ranking is provided in Figure 5.Let us first consider the six models in which one of the candidate arcs is added to the base graph.Evidently, adding a connection from LNL I to II is strongly supported given this dataset, and is slightly superior to the reverse connection from LNL II to I. In addition, there is strong evidence for introducing an arc from LNL IV to V. Furthermore, there is substantial evidence for an arc from LNL III to V. All other investigated additions lead to improvements that are barely worth a mention or do not justify the additional complexity at all, indicated by a lower evidence.
Based on these results, we consider three additional candidates for the optimal graph that combine the added arc from LNL I to II with the arc(s) III → V and/or IV → V.The model evidence for these three graphs is also reported in Table 6.The best performing graph with decisive evidence over the base graph turned out to be the one which combines the arcs from LNL I to II and IV to V. Interestingly, the evidence gain of this "winning graph" is roughly the sum of the gains seen in the two candidates where only one of these connections was added, respectively.This indicates that the two additional parameters are largely independent of each other and manage to describe different aspects of the data.
Table 6 additionally shows the evidence of graphs in which the arc from level II to III or from level III to IV is removed.The low model evidence for these graphs confirms the importance of these connections and is consistent with the anatomical motivation.The connection from level III to IV is crucial for describing the observa-tion that metastases in level IV are extremely rare without simultaneous involvement of the upstream level III.

The winning graph
The most likely model parameters for the winning graph, corresponding to the mean of the marginals of the sampled posterior distributions, are tabulated in Table 7.We have fixed p early = 0.3 for early T-category tumors (i.e., T0, T1, and T2), and t max = 10 time steps.The result that t II→III and t III→IV are relatively large compared to b III and b IV reflects the observation that skip metastases in levels III and IV without involvement of the upstream level are rare.Since level II's parent node (level I) is rarely involved, b II can approximately be related to the prevalence of involvement in level II.The probability for no involvement of level II when the patient is diagnosed after t time steps is (1 − b II ) t .The prevalence of level II involvement for advanced T-category patients is thus which agrees with the second panel from the top in Figure 4.The large value for the parameter t I→II reflects the observation that in almost all patients with level I involvement, level II is also involved.The large uncertainty in t I→II is related to the fact that level I involvement is rare compared to level II.

Involvement of levels I and V
We can observe that the winning graph describes the involvement of levels II, III, and IV equally well as the base graph, a result that is expected and not further shown.We thus focus on the improvements w.r.t.involvement patterns that include the LNLs I and V, that more rarely harbor metastases.
Level V: In Figure 6 we compare the base graph's and the winning graph's estimations for prevalences of involvement patterns that include LNL V.The base graph underestimates the probability that level IV and V are simultaneously involved, and overestimates the probability that level V but not IV is involved.By introducing the arc from level IV to V, the winning graph can describe the observation that level V involvement is typically associated with severe involvement of level II-IV.In the dataset, 14 patients out of 62 patients with level IV involvement have metastases in level V (22 %).Instead, only 40 patients out of 624 patients without level IV involvement have metastases in level V (6 %).
Level I: In Figure 7, analogous comparisons are shown for involvement patterns that include LNL I.The base graph overestimates the probability of level I involvement without simultaneous involvement of level II.By introducing the arc from level I to II, the winning graph can capture the correlations between levels I and II.It can also be noted that both models overestimate level I involvement for early T-category patients and underestimate level I involvement for advanced T-category patients.This is further described in the discussion section below.

Risk Prediction for Occult Disease
In this section, the model corresponding to the winning graph is applied to estimating the risk of occult metastases in clinically negative LNLs.We assume a sensitivity of 0.76 and a specificity of 0.81 for the clinical diagnosis of lymph node metastases, corresponding to CT imaging in Table 4.
Level II: As can be seen in Table 7, spread from the tumor to LNL II to be the most probable transition at any given time step.As a consequence, even for an early T-category patient that presents with a clinical N0 neck, our model predicts a 31.13 % ± 1.78 % risk for microscopic metastases in LNL II.
Level III: Figure 8 compares the risk of occult disease in level III between patients that are clinically N0     Level V: The right panels in Figure 9 show the risk of occult disease in level V depending on T-category and the clinical involvement of levels II-IV.For clinically N0 patients, the risk in level V is estimated to be just above 1%.Extensive nodal involvement of levels II-IV increases the risk in level V to more than 4% for advanced T-category tumors.
Level I: Figure 10 shows the risk of occult disease in level I depending on T-category and the clinical involvement of levels II-IV.For clinically N0 patients, the risk in level I is estimated to be in the order of 1-2%.Extensive nodal involvement of levels II-IV increases the risk in level I to just below 4% for advanced T-category tumors.It is pointed out that the winning graph does not contain arcs from levels III or IV to LNL I (and anatomically we do not assume that there is lymphatic drainage from levels III or IV to level I).Thus, the increased risk in level I is related to the time evolution: Getting diagnosed at a later time during the disease's evolution probably correlates with more advanced nodal metastasis.And involvement in the levels III and IV corresponds to a more advances state of disease that is likely diagnosed at a later time step, such that the tumor also had more time to spread to level I.The correlation between the clinical involvement pattern, the likely time of diagnosis in the tumor's time frame based on it, and from that the risk of involvement is another benefit of the formulation of the model as a HMM.
To illustrate the flexibility of the model in predicting various risks, we have plotted the risk for occult disease in any of the LNLs I, IV, and/or V, given different clinical diagnoses in Figure 11.Similar to this, we may compute the risk for an arbitrary combination of involved levels, given a similarly arbitrary clinical diagnosis.For the base graph (base-graph-v2) and the winning graph (win-graph-v3), one may also interactively explore these risks in our web-based interface LyProX, similar to how it is possible to explore the underlying data in an interactive way.In this publication we present a statistical model of ipsilateral lymph node involvement in oropharyngeal SCC patients.Although the basic HMM of lymphatic progression has been conceptually introduced in a previous work (Ludwig et al. 2021), this is the first publication that evaluates the model based on a large multi-institutional dataset containing 686 patients.It is demonstrated for the first time that the model can accurately describe the patterns of lymph node involvement observed in the data, including the correlations between levels and its dependence on T-category.Furthermore, techniques from statistical physics are applied to calculate the model evidence for Bayesian model comparison.This yields a complete model including all LNLs relevant for OPSCC: I, II, III, IV, V, and VII with a parameterization that balances accuracy and model complexity.

Implications for elective nodal treatment
Risk predictions obtained by the model may be used to design clinical trials on volume-deescalated treatment of OPSCC.In the context of radiotherapy, this corresponds to excluding LNLs from the CTV-N, which are irradiated according to the current guidelines.The list below should be seen as a summary of subsection 5.5 and the limitations discussed in subsection 6.3 should be taken into account in its interpretation.Assuming that one accepts approximately a 5% risk of occult metastases per LNL, the statistical model presented in this paper would suggest to: • Irradiate level II for all patients.
• Irradiate level III for most patients.Only for clinically N0, early T-category patients, not irradiating level III could be considered.
• Exclude level IV from the CTV-N for patients with clinically negative level III.For advanced Tcategory patients with involvement of level III, level IV should be irradiated.
• Exclude level V from the CTV-N for most patients.
Only for patients with extensive involvement of levels II, III, and IV, irradiation of level V can be considered.
• Exclude level I from the CTV-N for early Tcategory patients with limited metastatic disease.For advanced T-category patients with extensive nodal involvement of levels II, III, and IV, level I should be irradiated (see also the limitations discussed in subsection 6.3).
• Exclude level VII in all patients, unless it is clinically involved.

Limitations and future work
T-category dependence of level I involvement: As shown in Figure 4, the model describes the involvement of LNLs II, III, and IV depending on T-category very well despite having only a single parameter related to T-category. Figure 7 shows that the model does not perfectly describe the T-category dependence of level I.It adjusts the parameters such that level I involvement is correctly described for the set of all patients combined, but it overestimates level I involvement for early T-category and underestimates it for advanced Tcategory.A possible explanation is that advanced Tcategory tumors are more likely to have grown into regions with direct lymph drainage to level I.The more severe involvement in levels II, III, IV for advanced Tcategory can be explained by tumors having more time to spread while keeping the spread probability rates b 2 , b 3 , b 4 constant.Regarding level I, early versus advanced T-category tumors the model may need different spread probability rates b 1 to describe their involvement correctly.
Sensitivity and Specificity: Estimating the risk of occult metastases depends on the assumed parameter values for sensitivity and specificity of clinical detection of lymph node metastases.For this work, we adopted literature values for sensitivity and specificity.However, different authors have estimated these values using different criteria and different methods.Consequently, these values need to be considered with caution.Figure 12 illustrates for one example how the risk of occult disease depends on sensitivity and specificity.Here, we consider the risk in level IV in patients with clinically detected metastases in levels II and III.For our default parameters of 81% specificity and 76 % sensitivity, the risk is 5%, but it increases to around 8% for a a sensitivity of 66%.
Also, as described in section 4, we assumed the consensus of the data to represent the true state of nodal involvement.This was not strictly necessary: Instead of computing a consensus beforehand and providing that with sensitivity and specificity of 1 to the model, as if it were the ground truth, we could have provided multiple diagnostic modalities per patient to the model directly.In fact, for patients with a pathology report available, this would even yield the same results.But we also decided to consider the consensus as an observation of the true hidden state for patients without pathologically assessed involvement.We did this because the literature values for sensitivity and specificity of around 80% do not plausibly match the observation that around 78% of patients in the USZ cohort showed clinical LNL II

Figure 1 .
Figure 1.DAG representing a possible abstraction of the lymphatic network comprising the tumor (red shaded circle) and LNLs II through IV as hidden binary RVs (blue outlined circles).Attached to each of these is the corresponding observed RV (orange shaded squares).Lymphatic flow is depicted in the form of parameterized arrows (red and blue) that represent the probability of spread along the respective arc per time-step.Sensitivity and specificity (orange arrows) connect the hidden RVs to the diagnosis.

Figure 2 .
Figure 2. Extended DAG representing different possible spread graphs underlying the HMM.As in Figure1, red arcs are parametrized with probabilities of spread from the tumor (red circle) to the LNLs (blue circles).These red arcs, together with the blue arcs fron LNL to LNL, make up the base graph.One after the other, each of the green arcs was added to the base graph.Subsequently, the performance of the resulting models in terms of its BIC was compared to the base graph to assess whether the additional edge should be kept in the winning graph or not.

Figure 3 .
Figure3.The expected log-accuracy under the power posterior plotted against the value of the inverse temperature β.These values were computed during a TI run for the loosing graph model (red line), as well as the winning graph model (green line).Note that the scaling of the x-axis is chosen such that the 64 points on the temperature ladder appear evenly spaced.This is to stress how the log-accuracy develops in the range from β = 0 to around β = 0.1.The area under these two curves yields an approximation of the log-evidence of the respective model.It is barely visible that due to the simplicity of the loosing graph model its accuracy increases earlier and faster that the winning graph model's.However, already before β = 0.031 the model based on the winning graph outperforms the loosing graph model.Ultimately, the winning graph's log-evidence is better than the loosing graph model's by a value of around 28.46 (see Table6).

Figure 4 .
Figure 4. Prevalence of involvement as predicted by the base graph model for different scenarios involving the most commonly metastatic LNLs II, III and IV (shaded histograms).The model's predictions are compared to Beta posteriors over the prevalence based on the frequency of the same scenarios and a uniform prior (slid lines).The top panel shows some selected scenarios with early T-category tumors and the bottom panel with advanced T-category.

Figure 5 .
Figure 5. Visual ranking of the investigated graphs w.r.t.their model evidence, computed via thermodynamic integration.Not shown are the two graphs where the arc from LNL III to IV was removed.Their respective model evidence fell below −1528 and the two graphs would appear far left in the figure.In the bottom left corner, we provide a visual reference in analogy to Table3: E.g., any difference in the model evidence shorter than the first of the three rulers indicates that the improvement is "barely worth a mention".

Figure 6 .
Figure6.Observed (Beta posteriors as lines) vs. predicted (histograms) prevalences of involvement combinations that include LNL V. We have plotted the predictions from the winning graph (colored histograms) and those of the base graph (gray, hatched histograms).The top two panels show scenarios for early T-category patients, the bottom two panels for advanced T-category.The left two panels consider combinations of LNL III and V involvement, while the right two panels consider combinations of LNL IV and V.The colored lines show the Beta posterior over the prevalence of the respective involvement pattern, given the data.

Figure 7 .
Figure 7.Comparison of observed and predicted prevalences of LNL I and II involvement patterns.The top and bottom panels show the prevalences for early and advanced T-category, respectively.The solid lines are Beta posteriors from the data, while the histograms are predicted prevalences (colored: winning graph, gray-hatched: base graph).Blue and red plots indicate overall LNL I and II involvement, respectively.Green plots indicate LNL I involvement without level II, while orange plots indicate the opposite (LNL II without level I).The winning graph has an added edge from LNL I to II, which improves the prediction of the rare green pattern.

(
orange) and patients with clinically diagnosed involvement of only level II (red), for early T-category (upper panel) and advanced T-category (bottom panel).The histograms represent the uncertainty in the model's risk prediction arising from the uncertainty in the model parameters and are generated by randomly drawing a tenth of the samples from the model parameter's joint posterior distribution.This amounts to S = 20 • W samples,

Figure 8 .
Figure 8. Histograms over the risk for microscopic involvement in LNL III, given that a patient presents as clinically N0 (green), or given that LNL II shows clinical involvement (blue).The top panel shows these risks for early T-category, the bottom panel for late.

Figure 9 .
Figure 9. Distributions over the risk for microscopic involvement in LNL IV (left panels) and in LNL V (right panels) as predicted by the winning graph model, given early (top panels) or advanced T-category (bottom panels), and different CTbased diagnoses: 1) A clinical N0 patient (green histograms), 2) visible metastases in LNL II, but otherwise healthy-looking lymph nodes (blue histograms), 3) macroscopic metastases in the LNLs II & III, with the rest of the neck still being clinically node negative (orange histograms), and finally 4) extensive clinical involvement in the levels II, III, and IV (red histograms).levelII, the risk in level III increases to approximately 9% and 12%, respectively.Level IV: Figure9(left panels) compares the risk of occult disease in level IV for the typical clinical presentations: clinically N0 (green), metastases in level II (blue), and metastases in levels II and III (orange).The model predicts a low risk of 1-2% in level IV for patients with clinically healthy level III.For patients with clinical involvement of level III, the risk of occult disease in level IV increases to approximately 3% for early T-category and 5% for advanced T-category tumors.Level V: The right panels in Figure9show the risk of occult disease in level V depending on T-category and the clinical involvement of levels II-IV.For clinically N0 patients, the risk in level V is estimated to be just above 1%.Extensive nodal involvement of levels II-IV increases the risk in level V to more than 4% for advanced T-category tumors.Level I: Figure10shows the risk of occult disease in level I depending on T-category and the clinical involvement of levels II-IV.For clinically N0 patients, the risk in level I is estimated to be in the order of 1-2%.Extensive nodal involvement of levels II-IV increases the risk in level I to just below 4% for advanced T-category tumors.It is pointed out that the winning graph does not contain arcs from levels III or IV to LNL I (and anatomically we do not assume that there is lymphatic drainage from levels III or IV to level I).Thus, the increased risk

Figure 10 .
Figure 10.Distributions over predicted risk for involvement in LNL I, given different clinical diagnosis scenarios: For N0 patients (green), patients with macroscopic involvement in LNL II (blue), and for the case where the LNLs II, III, and IV show involvement.The top panel shows these risks for early T-category and the bottom row for advanced T-category.

Figure 11 .
Figure 11.Shown are the histograms over the predicted risk for involvement in any of the LNLs I, IV, V, or VII.The risk is plotted given a clinical N0 diagnosis (green), macroscopically detected metastases in LNL II (blue), and lastly given visible involvement in both LNL II and III (orange).The top row shows these risks for early T-category diagnoses and the bottom row for advanced T-category.

Figure 12 .
Figure 12.Dependence of risk of occult disease on sensitivity and specificity.The figure shows the risk of involvement in level IV, given the mean of the parameter samples, drawn during sampling, in patients with clinical involvement of levels II and III.

Table 2 .
P (Z | X) for all possible values of X and Z.
Diagnosis and true state of a patient are formally connected via the sensitivity s N and specificity s P of the used diagnostic modality.In clinical practice, these modalities are CT, MRI, or PET scan, but it may also include in- formation from biopsies after a fine needle aspiration (FNA) or other techniques to detect lymphatic metastases.For each LNL v the conditional probability table of P (Z v | X v ) looks like this:

Table 4 .
(De Bondt & Others 2007;Kyzas & Others 2008)s that we used to infer the most likely involvement for a patient when multiple diagnostic modalities reported conflicting nodal involvement(De Bondt & Others 2007;Kyzas & Others 2008).

Table 7 .
Mean and standard deviation of parameters sampled for the winning graph in percent.asdescribed in subsection 3.2.The model predicts a risk of just below 6% for early T-category tumors and 8% for T3 or T4 ones.For patients with involvement of